Week 06
Simple Linear Regression

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-23

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this lecture, you will be able to:

  • Fit and interpret simple linear regression models
  • Understand the least squares criterion
  • Assess model fit with R-squared
  • Make predictions from fitted models
  • Use simulation to understand regression

Key Readings

TSwD: Ch 12.1-12.2 | ROS: Ch 6-7

Introduction to Linear Models

Why Linear Models?

Linear models have been used for centuries to understand relationships in data.

Historical origins:

  • 1700s: Astronomers tracking celestial motion
  • Early statisticians were comfortable combining observations
  • Social scientists were slower to adopt (worried about grouping unlike data)

Modern uses:

  • Prediction and forecasting
  • Understanding relationships
  • Comparing groups
  • Estimating effects (with caution!)

What Regression Does

At a fundamental level, regression has two purposes:

  1. Prediction: Predict an outcome variable given some inputs
  2. Comparison: Compare predictions for different values of inputs

Key Insight

Regression is fundamentally a technology for prediction and comparison - not necessarily for identifying causal effects.

The Basic Regression Model

The simplest regression model is linear with a single predictor:

\[y = a + bx + \epsilon\]

Where:

  • \(y\) is the outcome (dependent variable)
  • \(x\) is the predictor (independent variable)
  • \(a\) is the intercept (value of \(y\) when \(x = 0\))
  • \(b\) is the slope (change in \(y\) for one unit change in \(x\))
  • \(\epsilon\) is the error (what the model doesn’t explain)

Visualising the Model

Least Squares Estimation

Finding the Best Line

How do we find the “best” line through the data?

Many lines could be drawn, but we want the one that fits best.

Criterion: Minimise the Residual Sum of Squares (RSS)

\[RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}e_i^2\]

Where:

  • \(\hat{y}_i\) is the predicted value for observation \(i\)
  • \(e_i = y_i - \hat{y}_i\) is the residual (prediction error)

What Are Residuals?

The Least Squares Solution

For simple linear regression, the least squares estimates are:

\[\hat{b} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

\[\hat{a} = \bar{y} - \hat{b}\bar{x}\]

Good News!

You don’t need to calculate these by hand - R does it for you with lm()

Fitting Regression in R

The lm() Function

In R, we use lm() (linear model) to fit regressions:

model <- lm(outcome ~ predictor, data = dataset)

Key components:

  • outcome ~ predictor: The formula (outcome on left, predictor on right)
  • data = dataset: The data frame containing your variables
  • Returns a model object you can examine

Example: Running Times

Let’s simulate data about the relationship between 5km run time and marathon time:

set.seed(853)
num_observations <- 200
expected_relationship <- 8.4  # Marathon is ~8.4x longer than 5km

sim_run_data <- tibble(
  five_km_time = runif(n = num_observations, min = 15, max = 30),
  noise = rnorm(n = num_observations, mean = 0, sd = 20),
  marathon_time = five_km_time * expected_relationship + noise
) |>
  mutate(
    five_km_time = round(five_km_time, 1),
    marathon_time = round(marathon_time, 1)
  ) |>
  select(-noise)

Visualising the Data

ggplot(sim_run_data, 
       aes(x = five_km_time, 
           y = marathon_time)) +
  geom_point(alpha = 0.5) +
  labs(
    x = "5km time (minutes)",
    y = "Marathon time (minutes)"
  ) +
  theme_classic(base_size = 18)

Fitting the Model

run_model <- lm(marathon_time ~ five_km_time, data = sim_run_data)
summary(run_model)

Call:
lm(formula = marathon_time ~ five_km_time, data = sim_run_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.289 -11.948   0.153  11.396  46.511 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.4692     6.7517   0.662    0.509    
five_km_time   8.2049     0.3005  27.305   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.42 on 198 degrees of freedom
Multiple R-squared:  0.7902,    Adjusted R-squared:  0.7891 
F-statistic: 745.5 on 1 and 198 DF,  p-value: < 2.2e-16

Understanding the Output

Coefficients:

  • Intercept (4.47): Expected marathon time if 5km time were 0 (not meaningful!)
  • Slope (8.20): For each additional minute in 5km time, marathon time increases by ~8.2 minutes

Model fit:

  • Residual SE (17.42): Typical prediction error
  • R-squared (0.79): 79% of variance explained
  • p-value (<2e-16): Strong evidence of a relationship

We recovered the truth!

True slope was 8.4, estimate is 8.2 ± 0.3 (standard error)

Adding the Regression Line

ggplot(sim_run_data, 
       aes(x = five_km_time, 
           y = marathon_time)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", 
              se = TRUE,
              colour = "steelblue") +
  labs(
    x = "5km time (minutes)",
    y = "Marathon time (minutes)"
  ) +
  theme_classic(base_size = 18)

Extracting Coefficients

# Get coefficients
coef(run_model)
 (Intercept) five_km_time 
    4.469242     8.204932 
# Get just the slope
coef(run_model)["five_km_time"]
five_km_time 
    8.204932 

The regression equation is:

\[\text{Marathon time} = 4.47 + 8.20 \times \text{5km time}\]

Interpreting Coefficients

Coefficients as Comparisons

Critical Insight

Regression coefficients are commonly called “effects,” but this can be misleading. We should think of them as comparisons, not causal effects.

What the slope really means:

“Comparing runners whose 5km times differ by one minute, we find their marathon times differ, on average, by about 8.2 minutes.”

This is a between-person comparison, not a within-person effect!

Example: Height and Earnings

Consider a regression predicting earnings from height and sex:

\[\text{earnings} = -26.0 + 0.6 \times \text{height} + 10.6 \times \text{male}\]

Tempting but wrong interpretations:

  • ❌ “The effect of height on earnings is $600 per inch”
  • ❌ “Being male causes $10,600 higher earnings”

Better interpretations:

  • ✓ “Comparing people of the same sex, those who differ by one inch in height differ on average by $600 in earnings”
  • ✓ “Comparing people of the same height, men earn on average $10,600 more than women”

Why This Matters

The height-earnings regression shows:

  • Taller people earn more (observational pattern)
  • This does NOT mean making someone taller would increase their earnings

Possible explanations (all consistent with the data):

  • Discrimination against shorter people
  • Height correlated with confidence
  • Height correlated with nutrition/health in childhood
  • Height correlated with social class
  • Some combination of all the above

Bottom Line

Regression tells us about associations, not causes. Causal interpretation requires additional assumptions and designs (Week 12).

Model Assessment

Residual Standard Deviation (σ)

The residual standard deviation tells us about prediction accuracy:

sigma(run_model)
[1] 17.42303

Interpretation:

  • About 68% of predictions are within ±17.4 minutes of actual marathon time
  • About 95% are within ±34.8 minutes

This comes from the normal distribution properties we learned in earlier weeks.

R-squared: Proportion of Variance Explained

\[R^2 = 1 - \frac{\text{Variance of residuals}}{\text{Variance of outcome}}\]

# Calculate R-squared
summary(run_model)$r.squared
[1] 0.7901541
# Or equivalently:
1 - var(residuals(run_model)) / var(sim_run_data$marathon_time)
[1] 0.7901541

Interpretation: 79% of the variation in marathon times is explained by 5km times.

Interpreting R-squared

What R² tells us:

  • How much variance is “explained”
  • Relative fit compared to a model with no predictors

Useful for:

  • Comparing models
  • Assessing predictive power

What R² doesn’t tell us:

  • Whether model is correct
  • Whether relationship is causal
  • Whether predictions are accurate enough

Be cautious:

  • Can be inflated by adding predictors
  • Not a measure of model “validity”

Examining Residuals

What we want: Residuals centred around zero, roughly symmetric, no patterns

Making Predictions

Using predict()

Once we have a fitted model, we can make predictions:

# Predict marathon time for someone with a 20-minute 5km
new_runner <- tibble(five_km_time = 20)
predict(run_model, newdata = new_runner)
       1 
168.5679 
# Prediction with confidence interval
predict(run_model, newdata = new_runner, interval = "confidence")
       fit      lwr      upr
1 168.5679 165.8405 171.2953

Prediction vs Confidence Intervals

# Confidence interval: uncertainty about the LINE
predict(run_model, newdata = new_runner, interval = "confidence")
       fit      lwr      upr
1 168.5679 165.8405 171.2953
# Prediction interval: uncertainty about INDIVIDUAL predictions
predict(run_model, newdata = new_runner, interval = "prediction")
       fit      lwr      upr
1 168.5679 134.1013 203.0345

Key difference:

  • Confidence interval: Where is the average marathon time for all 20-minute 5km runners?
  • Prediction interval: What marathon time might this specific runner achieve?

Visualising Prediction Intervals

The broom Package

Tidy Model Output

The broom package provides three key functions for working with models:

library(broom)
Function Purpose Returns
tidy() Coefficient estimates One row per term
glance() Model-level statistics One row per model
augment() Observation-level stats Original data + fitted values

tidy(): Coefficient Table

tidy(run_model, conf.int = TRUE)
# A tibble: 2 × 7
  term         estimate std.error statistic  p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)      4.47     6.75      0.662 5.09e- 1    -8.85     17.8 
2 five_km_time     8.20     0.300    27.3   4.70e-69     7.61      8.80

This is much easier to work with than raw summary() output!

glance(): Model Summary

glance(run_model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.790         0.789  17.4      746. 4.70e-69     1  -854. 1715. 1725.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Key columns:

  • r.squared: Proportion of variance explained
  • sigma: Residual standard deviation
  • statistic, p.value: F-test for overall model significance

augment(): Observation-Level Data

augment(run_model) |> 
  head()
# A tibble: 6 × 8
  marathon_time five_km_time .fitted  .resid    .hat .sigma   .cooksd .std.resid
          <dbl>        <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
1          164.         20.4    172.  -8.05  0.00585   17.5   6.32e-4    -0.463 
2          158          16.8    142.  15.7   0.0133    17.4   5.55e-3     0.906 
3          196.         22.3    187.   8.16  0.00501   17.5   5.55e-4     0.470 
4          160.         19.7    166.  -5.81  0.00670   17.5   3.77e-4    -0.334 
5          121.         15.6    132. -11.6   0.0175    17.4   4.00e-3    -0.670 
6          178.         21.1    178.   0.607 0.00529   17.5   3.24e-6     0.0349

Key columns: .fitted (predicted values), .resid (residuals), .hat (leverage), .cooksd (influence)

Simulation for Understanding

Why Simulate?

Fake-Data Simulation

Simulating data where we know the truth helps us:

  1. Check that our methods work correctly
  2. Understand what our estimates mean
  3. Explore properties of regression

“The most valuable benefit of doing fake-data simulation is that it helps you build and then understand your statistical model.” — Gelman, Hill & Vehtari (2020)

Simulation Example: Elections

# Step 1: Set true parameters
a <- 46.3  # True intercept
b <- 3.0   # True slope
sigma <- 3.9  # True residual SD

# Step 2: Generate fake data
x <- c(0.1, 3.2, 2.9, 3.8, 1.3, 4.0, 2.2, 1.0, 2.7, 0.7, 3.9, 2.6, 1.9, 1.5, 3.4, 2.0)
n <- length(x)

# Step 3: Simulate outcomes
set.seed(123)
y <- a + b * x + rnorm(n, 0, sigma)

Now we have data where we know the true relationship is \(y = 46.3 + 3.0x + \epsilon\)

Fit Model to Fake Data

fake_data <- tibble(growth = x, vote = y)
fake_model <- lm(vote ~ growth, data = fake_data)
tidy(fake_model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    44.1      1.84      23.9  9.35e-13
2 growth          4.36     0.710      6.14 2.54e- 5
Parameter True Value Estimate Within 2 SE?
Intercept 46.3 44.1
Slope 3.0 4.4

Repeated Simulation: Coverage

If we repeat this process many times, we expect:

  • 68% of 68% confidence intervals to contain the true value
  • 95% of 95% confidence intervals to contain the true value
# Conceptual code (takes time to run)
n_sims <- 1000
cover_95 <- rep(NA, n_sims)

for (s in 1:n_sims) {
  y_sim <- a + b * x + rnorm(n, 0, sigma)
  fit <- lm(y_sim ~ x)
  ci <- confint(fit)["x", ]
  cover_95[s] <- (ci[1] <= b) & (b <= ci[2])
}
mean(cover_95)  # Should be approximately 0.95

Regression to the Mean

A Historical Puzzle

Francis Galton noticed something curious about height:

  • Children of tall parents tend to be taller than average…

  • …but shorter than their parents

  • Children of short parents tend to be shorter than average…

  • …but taller than their parents

This is regression to the mean - where the term “regression” comes from!

Why Does This Happen?

The Resolution

Apparent paradox: If heights regress to the mean, won’t variation disappear?

Resolution:

  • The point prediction regresses toward the mean (slope < 1)
  • But the error adds variation back
  • Net result: variance stays approximately constant across generations

Key Insight

Regression to the mean occurs whenever predictions are imperfect. It’s a mathematical fact, not a causal process.

The Regression Fallacy

Consider students taking midterm and final exams:

  • Students who do well on the midterm tend to do worse on the final
  • Students who do poorly on the midterm tend to do better on the final

Wrong interpretation: High performers get lazy, low performers work harder

Correct interpretation: This is regression to the mean - both exams measure ability imperfectly, and extreme scores tend to be partly due to luck

Real Example

Flight instructors found pilots improved after criticism and got worse after praise. Actually, this was just regression to the mean - no causal effect of feedback!

Practical Tips

Common Mistakes to Avoid

Interpretation errors:

  • Calling coefficients “effects”
  • Ignoring uncertainty (SE)
  • Over-interpreting R²
  • Extrapolating beyond data

Technical issues:

  • Not checking residuals
  • Ignoring non-linearity
  • Forgetting about outliers
  • Confusing correlation with causation

Good Practice Summary

  1. Always visualise your data before and after fitting

  2. Interpret cautiously - use comparison language, not causal language

  3. Report uncertainty - coefficients without standard errors are incomplete

  4. Check assumptions - examine residuals for patterns

  5. Simulate - if unsure how something works, simulate it!

Summary

What we learned:

  • Regression finds the best-fit line
  • Minimises sum of squared residuals
  • Coefficients are comparisons
  • R² measures variance explained
  • Prediction has uncertainty

Key R functions:

  • lm() - fit models
  • summary(), coef() - examine results
  • predict() - make predictions
  • broom::tidy(), glance(), augment() - tidy output
  • residuals(), fitted() - diagnostics

Next Week

Week 7: Multiple Regression

  • Adding multiple predictors
  • Interpreting coefficients “controlling for” other variables
  • Categorical predictors (dummy variables)
  • Building and comparing models

Preparation

Read TSwD Ch 12.3-12.4 and ROS Ch 9-10

References